Todo
Run simulation study on synthesized data (- Emphasize on noise, correlation, hyperparameter r, comparing with elbow plots, use the same generated data for the different methods) - document along the way
Report:
Polishing:
When dealing with modern large datasets, the observations’ dimensionality often becomes a problem regarding the interpretability of the dataset, or for its vizualisation, especially during the data exploration phase. The Principal Component Analysis (PCA) is a statistical technique used to reduce the dimension of the dataset, while trying to keep as much information as possible from the original dataset (i.e. variance). An example of PCA usage is bringing a high-dimension dataset to \(\mathbb{R}^2\) or \(\mathbb{R}^3\) to vizualise its datapoints and afterwards identifying clusters. It does so by computing Principal components from the dataset and then projecting datapoints to this new basis.
Mathematically speaking, we can define principal components as an orthonormal basis of unit vectors which sequentially capture the most variance in the data. Practically, the PCs can be found with the following optimization problems for a given centered Gaussian dataset \(\mathbf x_1,\ldots,\mathbf x_N \in \mathbb{R}^p\): \[ \begin{split} v_1 &= \underset{v, \| v \|=1}{\mathrm{arg\,max}} \mbox{ } v^\top \widehat{\boldsymbol{\Sigma}} v \\ v_2 &= \underset{v, \| v \|=1, v^\top v_1 = 0}{\mathrm{arg\,max}} v^\top \widehat{\boldsymbol{\Sigma}} v \\ &\;\;\vdots \end{split} \]
with \(\widehat{\Sigma} = N^{-1}\sum_{n=1}^N \mathbf x_n \mathbf x_n^\top \in \mathbb{R}^{p\times p}\) being the empirical covariance matrix of the dataset. More information about Principal component analysis can be found in Jolliffe, 2002 or directly on Wikipedia.
Figure 1.1: Example of 2 PCs on a given dataset in \(R^2\)
This project focuses on finding the optimal number of principal components (\(r\)), which dictates a trade-off between the model’s complexity and its interpretability. In fact, higher the \(r\), higher the explained variance, but less interpretable the dataset. To optimally choose the number of PCs, several techniques have been created, such as the method of percentage of variance explained, the scree-plot method or Cross-validation for PCA.
In this report, one will first find a presentation of each Cross-Validation method for PCA with an initial description, followed by a pseudo-code in [2. Cross-validation on PCA methods]. Then, those methods will be ran on simulated datasets with different parameters in [3. Simulation and comparison of CV methods]. Furthermore, the percentage of variance explained and the scree-plots methods will be described and tested in [4. Other PCs number selection methods]. Finally, the conclusion will wrap up the take-home messages as well as next steps which could be conducted regarding this subject.
A widely used approach for Cross-validation in PCA is a method that has the right intentions but fails at a critical point :it does not take into account any variance-bias tradeoff, meaning that the optimal solution will always be \(r = R\). We start with the usual conventions by denoting \(X \in \mathbb{R}^{n \times p}\) our data with \(n\) observations and \(p\) variables, then as presented in Notes 06 - T. Masak, below is the pseudo-code of the Naïve approach:
A numerical method to solve the least square problem in the top of the algorithm would be to compute the SVD decomposition of \(\mathbf X[J_k^c,:]\) and truncate the rank to \(r\) by setting the \(p-r\) smallest singular values to zero. To estimate \(\mathbf{x^{m}}, \forall m \in J_k\) with a lower dimensional version, we project it with the matrix \(P_{\mathbf{\widehat{L}}}\) which can be computed via the the R built-in function prcomp() the returns among other objects a basis for the dimension of \(\widehat{\mathbf L}\). The promised issue occurs when computing the error measure without further to-do. Thus the error obviously shrinks when \(\widehat{\mathbf L}\) is allowed to have higher dimensions and converges to \(0\) when its number of dimensions equals those of the initial data set \(\mathbf{X}\), as the SVD will be a nearly exact representation of the matrix \(\mathbf{X}\).
This method corrects the [2.1 Naïve Approach] by introducing a Bias-Variance trade-off for the \(Err_k\) function. Let again \(\mathbf{X} \in \mathbb{R}^{n \times p}\) be our Multivariate Gaussian dataset with \(n\) observations and \(p\) variables, mean \(\mu\) and covariance \(\Sigma\). As described in Notes 06 - T. Masak, we implement this method as below :
It is important to emphasize some points of this aforementioned algorithm. When we truncate \(\hat{\Sigma}\) to \(\hat{\Sigma}_r\) in every fold, we do it by eigenvalue decomposition, then store \(\alpha = \sum_{i=r+1}^p \lambda_i\), and then set to zero those \(p-r\) smallest eigenvalues. Afterwards, we compute \(\hat{\Sigma}_r = U\tilde{V}U^T\) and then add \(\alpha / p\) to its diagonal. This is done to smooth the drop of eigenvalues when truncating the covariance estimator and avoid undesirable results of our algorithms.
Hence, we reduce the rank of the covariance of \(\mathbf{X}\), but we preserve its dimension which allows us to estimate \(\mathbf{x}^m_{miss}\) in the later part of the algorithm. When splitting the observations in every iteration of the for loop, it naturally comes to mind if we should always take the same or a different (random) split. The first possibility was chosen as that additional random factor would be negligible by the SLLN. Nevertheless, it makes sense to keep two equally sized parts in order to have a proper estimation basis without overestimating. This also gives us a good understanding whether our estimation method will work in harsh conditions, i.e. if half of the data is missing.
When estimating \(\hat{\mathbf{x}}^m_{miss}\), the linear estimator is used as it emerges from the law of our data set \(\mathbf{X}\). Hence, it is only applicable in this form to Gaussian distributed observations \(\mathbf{x} \in \mathbb{R}^p\). The estimation also makes use of the pseudoinverse of \(\hat{\Sigma}_{obs,obs}\) denoted by \(\hat{\Sigma}^\dagger_{obs}\). We use this convention in order to avoid singularity issues when inverting a block of the truncated covariance estimator. The pseudoinverse is easily callable in R by the function ginv() embedded in the MASS package.
Here it is assumed that the rows \(\mathbf{X}_{n} \in \mathbb{R}^p, n=1,…,N\) of the data matrix \(\mathbf{X}=(\mathbf{X}_{n,j})^{N,p}_{n,j=1}\) are i.i.d. realizations of a multivariate Gaussian random variable of mean \(\mu \in \mathbb{R}^p\) and a covariance \(\Sigma \in \mathbb{R}^{p \times p}\). We will assume that rank\((\Sigma)=r\) for \(r=1,…,R\), and compare how well different values of \(r\) fit the data. As described in Notes 06 - T. Masak, this approach is based on computing the estimators \(\hat{\Sigma}\) and \(\hat{\mu}\) based on observed data by the EM algorithm while the set of missed and observed data is randomly selected at the beginning of the algorithm. Then, it performs more or less the same steps as the previous algorithm to find the optimal number of PCs. Let us now describe the EM algorithm used in our case and then describe the Missing data approach algorithm.
Let \(\tilde{X}\) be the dataset containing missing values randomly placed on the original \(N \times p\) dataset. The EM algorithm works as follows :
This method slightly differs from the one presented in Notes 05 - T. Masak, mainly due to poor results when estimating \(\hat{\Sigma}\) and light confusion about notations used in the aforementioned implementation. Another matrix approach for the EM algorithm was also tried from Machine Learning, Murphy which gave good results for \(\hat{mu}\) but was also unsuccessful when predicting \(\hat{\Sigma}\) when \(r\) becomes too large.
Below is the pseudo-code for the Missing Data method :
As one can observe, this method is pretty similar to the previous one, except for the computation of the missing values method and the parameters estimation. Here, the missing values are randomly selected across the dataset and can be seen as holes in the dataset, while for the previous CV method, whole columns (aka variables) were taken as missed. Also, as we are now dealing with holes in the dataset, it was not guaranteed that there would be a fully defined estimator for \(\Sigma\), compared to the other method for which we could easily select parts of the observed and missed covariance. This forced us to use the EM algorithm for this CV method, as opposed to the other one where a simple \(mean\) and \(cov\) did the job.
This approach is a modification of the KDE method from Notes 06 - T. Masak. Let, as usual, \(\mathbf{X}\) be a multivariate Gaussian dataset of size \(n \times p\) The goal is to minimize w.r.t \(R\) the following quantity :
\[ \mathbb{E} \| \widehat{C}_R - C \|_F^2 = \mathbb{E}\| \widehat{C}_R \|_F - 2 \mathbb{E} \langle \widehat{C}_R, C \rangle + \| C \|_F^2 \]
with \(\widehat{C}_R\) being the truncated covariance estimator of rank \(R\), \(C\) being the covariance matrix and \(\|.\|_F\) being the Frobenius norm while \(\langle X, Y \rangle\) represents the sum of all elements from the Hadamard product of the two matrices \(X,Y\), or in mathematical terms : \(\langle X, Y \rangle := \sum_{i,j}X_{i,j}Y_{i,j}\). Assuming that the mean of \(\mathbf{X}\) is zero, we get: \[ \mathbb{E} \langle \widehat{C}_R^{(-n)}, X_n X_n^\top \rangle = \langle \underbrace{\mathbb{E}[ \widehat{C}_R^{(-n)}}_{\approx \mathbb{E} \widehat{C}_R}], \underbrace{\mathbb{E} [X_n X_n^\top}_{ = C}] \rangle \approx \mathbb{E} \langle \widehat{C}_R , X_n X_n^\top \rangle \]
Therefore, since the truncated covariance estimator is not linear we obtain an approximation, which we will plug back into the initial quantity we wish to minimize. By differentiating the quantity w.r.t \(R\), we obtain the desired ranking as it will be the only value depending on \(R\). Overall, we implement this method as follows for every rank \(R\):
Stepping into the for-loop we realize that it actually resembles a Leave-One-Out Cross-validation. In this Method we are trading off high computational cost against a stable solution. As shown in the simulations, we will often get negative values for \(Err_R\), as one of its components cannot be computed (\(\|C\|_F^2\)) but was not necessary to find the optimal \(R\)
This final methods is somehow a bit more general, as we do not need to assume any probability distribution of the data \(\mathbf{X} \in \mathbb{R}^{N\times p}\), but just that \(X\) has a finite rank. Indeed, for a given data matrix \(\mathbf{X}\), we assume that a portion of this matrix is missed and that the remaining elements are observed. Let \(\Omega\) be a bivariate index set for the observed data in \(\mathbf{X}\). The matrix completion method consists of finding a matrix \(\mathbf M\) such that :
\[ \mathrm{arg \, min}_{\mathbf{M}=(M_{ij})_{i,j=1}^{N \times p}} \sum_{(i,j) \in \Omega} (X_{ij} - M_{ij})^2 \quad \mathrm{s.t.} \quad \mathrm{rank}(\mathbf M)=R \]
This optimization method gives us the best matrix \(\mathbf M\) with rank \(R\) w.r.t least-square error on the missed data. However, finding the solution of such a optimization problem is \(NP\)-hard, so one can use the following iterative algorithm to find local best candidates for the matrix \(\mathbb M\) :
We decided multiply the initial \(\|M^{(l)}-M^{(l-1)}\|_F\) stopping criteria by \(\|M^{(l)}\|_F\) to take into account the magnitude of the elements in \(M\), and added the condition \(l<1000\) in case of the algorithm would not converge to the desired tolerance in a reasonable amount of time. In other words, the algorithm stops and returns the matrix \(M^{(l)}\) if there was a change of less than \(0.01\%\) from the previous matrix \(M^{(l-1)}\), or if we go above \(1000\) iterations.
After the end of [2. Cross-validation on PCA methods] we will be equipped with multiple approaches which enable us to recover the best amount of dimensions needed to represent data “accurately” of a data set. Intuitively, every method must have its different advantages and draw backs. In order to get a better feeling on how our Cross-Validation methods behave in different circumstances, we will run a simulation study on a variety of synthesized data sets. This allows us to discover the limits of the different methods used and allow us to conclude and recommend different Algorithms w.r.t its underlying data. As by construction of our Alogrithms every single one of them besides the [2.1 Naïve Approach], the [2.5 Matrix completion method] is only applicable to Multivariate Gaussian distributed data. Therefore we restrict ourselves to this special distribution and create corresponding data in order to get meaningful results. All in all, we consider \(12\) different data sets and compare the results of our five algorithms. Furthermore, we only consider centered Gaussian Multivariate Random variables and create three base data sets, on which we infer four kinds of different noises. The first data set is a standard Multivariate Gaussian data set while the second and third base data sets will rely on a random and structured covariance matrix. We denote our base data sets as \(D^{i}_0, D^{i}_1\) and \(D^{i}_2\) and \(i \in \{1,\dots,3\}\). It consists of data \(X=\{X^1,\dots,X^n\}\) and noise \(\epsilon^{i}\). The amount of different parameters make it difficult to write get a single big picture of the problem, which is the reason we implemented a Shiny App that lets the reader tune the parameters as he pleases. We strongly recommend to check it out and test the limits of cross-validation for him oder herself.
\[ D^{i}_0 = X + \epsilon^{i}_p, \text{ where } X^j\sim\mathcal{N}(0, \mathbb{1}_{p\times p}), \forall j = 1,\dots,n\\ D^{i}_1 = X + \epsilon^{i}_p, \text{ where } X^j\sim\mathcal{N}(0, \Sigma_1), \forall j = 1,\dots,n \text{ and } \Sigma_1=M\cdot M^{T}, (M_{ij})_{1\leq i,j\leq p} \stackrel{\text{iid}}{\sim} \mathcal{N}(0,1)\\ D^{i}_2 = X + \epsilon^{i}_p, \text{ where } X^j\sim\mathcal{N}(0, \Sigma_2), \forall j = 1,\dots,n \text{ and } \Sigma_2=M\cdot M^{T}, (M_{ij})_{1\leq i,j\leq p}=\frac{(i-1)p+j}{p} \]
For the covariance \(\Sigma_2\) of \(D^{i}_2\) we will get a structured matrix that will emphasize the meaningful position of the last variables. For clarity purposes, we will give an example for \(p=3\):
\[ M = \frac{1}{9}\begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \\ \end{pmatrix} \implies \Sigma_2 = \begin{pmatrix} 0.81 & 0.96 & 1.11 \\ 0.96 & 1.15 & 1.33 \\ 1.11 & 1.33 & 1.56 \\ \end{pmatrix} \]
In order to have valid comparisons across all of our methods and validate their correctness, we modify \(X\). First, we realize that by the random construction of \(X\), we can assume that it has full rank \(p\). Second, we truncate it to a specific rank \(r\), which should be the recommended dimensions of our Algorithms. Due to asymmetry of \(X\), we do so by performing a singular value decomposition of \(X\) with the built-in svd() function. We will set the \(p-r\) smallest of its singular values to \(0\). This enables us to control the rank of \(X\). We realize that \(D_2\) is rank \(2\) before adding noise. Having adapted \(X\), we infer either uniform noise, differing noise or noise that will increase for each variable.
\[ \epsilon^{1}_p\stackrel{\text{iid}}{\sim}\mathcal{N}(0,\sigma^2\mathbb{1}_{p\times p}), \sigma^2=1.5 \\ \epsilon^{2}_p\stackrel{\text{iid}}{\sim}\mathcal{N}(0,\sigma^2\mathbb{1}_{p\times p}), \sigma^2=0.2 \\ \epsilon^{3}_p\stackrel{\text{iid}}{\sim}\mathcal{N}(0, U), U_{ii}\stackrel{\text{iid}}{\sim} \mathcal{U}([0,2]) \text{ and } U_{ij}=0, \text{for } i\neq j \\ \]
We run our simulation with \(n=100\) observations per data set and \(p=8\) variables. We truncate our data set to rank \(r=3\) and repeat our cross-validation \(sim=20\) times on the seed \(1312\). The methods using classical cross-validation will loop over \(K=5\) folds.
For this analysis, we focus on \(D_0 \in \mathbb{R}^{N \times p}\) and will vary the noise \(\epsilon^i_p\) for \(i \in [1,3]\). Hereafter are the errors for each CV method described in Section 2.
Figure 3.1: Comparison of CV methods on base 0 dataset with different noises as indicated by Noise°1,2 and 3
COMMENTS
Figure 3.2: KDE Approach for base 0 dataset with different noises as indicated by Noise°1,2 and 3
COMMENTS
Figure 3.3: Comparison of CV methods on base 1 dataset with different noises as indicated by Noise°1,2 and 3
COMMENTS
Figure 3.4: KDE Approach for base 1 dataset with different noises as indicated by Noise°1,2 and 3
COMMENTS
Figure 3.5: Comparison of CV methods on base 2 dataset with different noises as indicated by Noise°1,2 and 3
COMMENTS
Figure 3.6: KDE Approach for base 2 dataset with different noises as indicated by Noise°1,2 and 3
Figure 3.7: Comparison of CV methods on different dataset for Noise n°1 (high noise)
COMMENTS
Figure 3.8: Comparison of CV methods on different dataset for Noise n°2 (low noise)
COMMENTS
Figure 3.9: Comparison of CV methods on different dataset for Noise n°3 (differing noise)
The percentage of variance explained method is based on the dataset’s covariance matrix, and its eigenvalues and the total variability (sum of diagonal entries of the covariance matrix). When computing the PCs for a given dataset \(\mathbf{X} \in \mathbb{R}^{N \times p}\) of covariance \(\Sigma\), we create an orthonormal basis with a corresponding diagonal covariance matrix \(\Sigma_{PCA}\) with its entries being the eigenvalues of corresponding PCs. As we sequentially construct the components, the eigenvalues \(\lambda_1, \lambda_2, ... ,\lambda_p \in \mathbb{R}\) of the new matrix will be a decreasing sequence. The method of percentage of variance explained is simply to select the first \(r\) components such that \(\frac{1}{V}\sum_{i=1}^r \lambda_i \ge \tau\) for an arbitrary threshold \(\tau \in [0,1]\) and a total variability \(V = \sum_{i=1}^p \lambda_i = tr(\Sigma_{PCA})\). Below are a few examples on the method on \(D_0\):
Figure 4.1: Eigenvalues of the PCA covariance matrix, with a treshold of 90% for \(D_0\) w.r.t noise
The Scree-plot is again related to the eigenvalues of the newly created orthonormal basis of PCs. As shown in the above, the eigenvalues corresponding to the PCs are a decreasing sequence, as seen on the graph above. The Scree-plot method consists of choosing the value of \(r\) which corresponds to the elbow of the eigenvalue graph. With \(100\) and \(8\) variables we get an overview to which extent every single value contributes to the variance of the data set as a whole.
Figure 4.2: Scree plot on \(D_0\) with different noises as indicated by 1,2 and 3
We can clearly see that the first three plots singular values are much bigger then the latter ones. In every plot the first three singular values only decrease slightly until the \(4\)th, where we can observe a strong drop. From the \(4\)th on the values decrease in the same manner as the first \(3\). In this circumstances the [Scree-plot method (non-automatic)] would favor the \(4\)th value as this value initiates the elbow characteristic of the graphs in every single one of the graphs. This selection would unfortunately be the wrong one as by construction, e.g. truncation of our data set we were expecting the third singular value to initiate the “elbow” form.
As seen below, the “Elbow-Plot” of our data set with a random matrix as covariance has a somehow similar shape. The first eigenvalue however explains much more on the overall variance of our data set. It drops strongly until the second, where the graph seems to stabilize. This behaviour holds on until the third singular value, whereafter we observe a similar drop and decreasing behaviour as in the previous plot. As we already concluded in Figure 4.2, the [Scree-plot method (non-automatic)] would favor a wrong value.
Figure 4.3: Scree plot on \(D_1\) with structured matrix as covariance with different noises as indicated by 1,2 and 3
The scree plot of \(D_2\), however looks more promising as we get a sharp drop after the first value for every kind of added noise. Therefore choosing the optimal rank \(r\) according to the [Scree-plot method (non-automatic)] would coincide with the true rank of data set (which was by construction \(2\) regardless of the amount of observations and variables) ++TRUE?!
Figure 4.4: Scree plot on \(D_2\) with structured matrix as covariance with different noises as indicated by 1,2 and 3
Compared to CV methods described in Cross-validation on PCA methods, those methods are relatively easy to implement, efficient as there is only a eigenvalue decomposition of the covariance matrix which is computationnally expensive and pretty interpretable. However, those methods are not well generalizable if there is some missing values in the dataset or if the eigenvalues do not follow a clear elbow pattern (for the scree-plot).
In conclusion, several methods for choosing the optimal number of components has been presented. We could observe the behavior of those CV methods depending on the noise and the structure of covariance matrices thanks to simulation on various datasets. We concluded that some methods are more robust to noise (EXAMPLE) while other tend to give poor results when confronted to noise (EXAMPLE). Moreover, we had to assume that our dataset was a centered multivariate Gaussian for every method to be comparable, however, some methods generalize better to other type of dataset’s distribution (Matrix Completion for example) while others are more restrictive (EXAMPLE). Computationwise, some methods are more efficient than others (EXAMPLE) and, assuming that all conditions are met to use those methods, the most efficient ones are therefore favoured when working for bigger datasets.